Antipattern: Format Comma-Separated Lists
Let's explore Jaywalking antipattern in more detail.
Returning to our example, to minimize changes to the database structure, you decide to redefine the
account_id
column as a VARCHAR
so you can list multiple account IDs in that column, separated by commas.
This seems like a win because you’ve created no additional tables or columns; you’ve changed only one column’s data type. However, let’s look at the performance and data integrity problems this table design suffers from.
Querying products for a specific account#
Although a foreign key can handle the insert
, update
, and delete
functions, the queries are difficult if all the foreign keys are combined into a single field. Thus, you can no longer use equality; instead, you have to use a test against some kind of pattern. For example, MySQL lets you write something like the following to find all the products for account 12
:
Note: We can also use some other queries for data retrieval. For example, you can check it for accounts 34 or 23.
Pattern-matching expressions may return false matches and can’t benefit from indexes. Since the pattern-matching syntax is different in each database brand, your SQL code isn’t vendor-neutral.
Querying accounts for a given product#
Likewise, joining a comma-separated list to the matching rows in the referenced table is awkward and costly.
Let’s run the query in the following widget. Try to run this query after making some changes, like using a different regular expression.
We can also retrieve the available data by using SELECT * FROM Products
before using the code given in the following widget to retrieve the already available data in the database.
Joining two tables using an expression like this one spoils any chance of using indexes. The query must scan through both tables, generate a cross product, and evaluate the regular expression for every combination of rows.
If we use regular expressions to pick out parts of a string, this could be an indication that we should store those parts separately. Otherwise, this solution would be an antipattern.
Making aggregate queries#
Aggregate queries use functions like COUNT()
, SUM()
, and AVG()
. However, these functions are designed to be used over groups of rows, not comma-separated lists. We have to resort to tricks like the following:
Tricks like this can be clever but never clear. These kinds of solutions are time-consuming to develop and hard to debug. Some aggregate queries can’t be accomplished with tricks at all.
Updating accounts for a specific product#
We can also add a new ID to the end of the list with string concatenation, but this may affect the sorting of the list.
Let’s check in the following widget if the data is updated successfully.
To remove an item from the list, we have to run two SQL queries: one to fetch the old list and a second to save the updated list.
<?php
$stmt = $pdo->query(
"SELECT account_id FROM Products WHERE product_id = 123");
$row = $stmt->fetch();
$contact_list = $row['account_id'];
// change list in PHP code
$value_to_remove = "34";
$contact_list = split(",", $contact_list);
$key_to_remove = array_search($value_to_remove, $contact_list);
unset($contact_list[$key_to_remove]);
$contact_list = join(",", $contact_list);
$stmt = $pdo->prepare(
"UPDATE Products SET account_id = ?
WHERE product_id = 123");
$stmt->execute(array($contact_list));
?>
That’s quite a lot of code just to remove an entry from a list.
Validating product IDs#
Since we have already seen that we can query the available data in different ways, what prevents a user from entering invalid entries like “banana”?
Let’s try to run the following widget to see the changes made.
Users will find a way to enter any and all variations, and the database will turn to mush. There won’t necessarily be database errors, but the data will become nonsense.
Choosing a separator character#
Some list entries may contain our separator character if we store a list of string
values instead of integers
. Using a comma as the separator between entries may become ambiguous. We can choose a different character as the separator. Still, can we guarantee that this new separator will never appear in an entry?
For example, you use a semicolon instead of a comma. How can you be sure that this will never appear in another entry?
List length limitations#
Here comes another problem. How many list entries can we store in a VARCHAR(30)
column? It depends on the length of each entry. If each entry is two characters long, then we can store ten entries (since the comma is also counted as a character). But if each entry is six characters, then we can only store four entries:
Let’s try adding some more values in the account_id
column.
Let’s add 10,14,18,22,26,30,34,38,42,46,48,37
and/or 101418,222630,343842,467790,765439
in the following playground to check if it works.
How can we know that VARCHAR(30)
supports the longest list that we will need in the future? How long is long enough?
Try explaining the reason for this length limit to a boss or to customers.